Visual Tracking with Similarity Matching Ratio 
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Abstract: This paper presents a novel approach to visual tracking: Similarity Matching Ratio (SMR). The traditional 
approach of tracking is minimizing some measures of the difference between the template and a patch from 
the frame. This approach is vulnerable to outliers and drastic appearance changes and an extensive study is 
focusing on making the approach more tolerant to them. However, this often results in longer, corrective algo- 
rithms which do not solve the original problem. This paper proposes a novel approach to the definition of the 
tracking problems, SMR, which turns the differences into a probability measure. Only pixel differences below 
a threshold count towards deciding the match, the rest are ignored. This approach makes the SMR tracker 
robust to outliers and points that dramaticaly change appearance. The SMR tracker is tested on challenging 
video sequences and achieved state-of-the-art performance. 



1 INTRODUCTION 

Visual tracking of objects in a scene is a very im- 
portant component of a unified robotic vision sys- 
tem. Robots need to track objects in order to inter- 
act. As such as they move closer, robots and other 
autonomous vehicles will have to avoid other moving 
objects, humans, animals, as they operate in our ev- 
eryday environment. 

The human visual system object tracking perfor- 
mance is currently unsurpassed by engineered sys- 
tems, thus our research tries to take inspiration and 
reverse-engineer the known principles of cortical pro- 
cessing during visual tracking. Visual tracking is 
a complex task, with neuroscience studies of corti- 
cal processing painting an incomplete picture, and 
thus is only partially able to guide the design of a 
synthetic solution. Nevertheless a few key features 
arise from studying the human visual system and its 
tacking abilities: (1) the human visual system is not 
limited to three-dimensional conventional objects in 
space, rather is able to track a set of visual features 
(Blas er et al., 2000 ). Thus object in this paper refers 
to a distinct group of features in the two-dimensional 
space. (2) It is not necessary for humans to have 
knowledge of the object class before visual tracking, 
and (3) humans can track an object after a very brief 
presentation. Even though the human visual system 
does not operate with frames it is common to desire 
synthetic systems to be able to track from a single 



frame, or just a few (tens). 

Visual tracking in artificial systems has 
been studied for decades, with laudable results 
( Yil maz et al., 2006) . In this paper we focus on 
bio-inspired visual tracking systems that can be 
part of a unified neurally-inspired vision system. 
Ideally, a unified visual model would be able to 
parse and detect an object every frame, but right now 
there is no bio-inspired model that can do this in 
real-time flDiCarlo et al., 20121 |LeCun et al., 20041 
|Serre et al., 2007 1. Deep neural networks come 
close to this performance when trained to look 
for a single object on a large collection of images 
( |Sermanet et al., 2011) . 

When we think of visual tracking we often have 
in mind a familiar object in space. But humans are 
able to track any localized variation in a 2D field, 
such as a set of features (Blaser et al., 2000). It is a 
high-SNR peak-detector that allows us to track a puff 
of smoke or a cloud, for example. A bio-inspired 
synthetic visual tracker is generally thought of hav- 
ing two outputs of the same unified stream: one 
is a deep neural network classifier that is capable 
of categorizing object, another is a shallower classi- 
fier that can group features into objectness. The first 
deep system is used to be able to continue track- 
ing an object as it disappears and reappears in the 
scene, while the second system provides rapid group- 
ing of local features, by tracking local maxima in 
the retinal space. Such distinction might be neces- 



Table 1: Properties of the video dataset used in this work (Kalal e t al., 2010a} . 



Video Sequence 





1. David 


2. Jumping 


3. Pedestrian 1 


4. Pedestrian2 


5. Pedestrian3 


6. Car 


Number of Frames 


761 


313 


140 


338 


184 


945 


Camera Movement 


yes 


yes 


yes 


yes 


yes 


yes 


Partial Occlusion 


yes 


no 


no 


yes 


yes 


yes 


Full Occlusion 


no 


no 


no 


yes 


yes 


yes 


Pose Change 


yes 


no 


no 


no 


no 


no 


Illumination Change 


yes 


no 


no 


no 


no 


no 


Scale change 


yes 


no 


no 


no 


no 


no 



Similar Objects no no no yes yes yes 



Table 2: Number of correctly tracked frames from the state-of-art trackers and the SMR tracker. Table is taken and modified 
from l |Kalal et al., 2010b) . 









Video Sequence 










1. David 


2. Jumping 


3. Pedestrian 1 


4. Pedestrian2 


5. Pedestrian3 


6. Car 


Number of Frames 


761 


313 


140 


338 


184 


945 



JLim et al., 2004} 17 75 11 33 50 163 

jCollins et al., 2005) n/a 313 6 8 5 n/a 

dAvidan, 2007) 94 44 22 118 53 10 

(Babe nko etal, 2009) 135 313 101 37 49 45 

|Kalal et al„ 2010b) 761 170 140 97 52 510 



SMR (this work) 761 313 140 236 66 510 



sary as a deep system will need 100-200ms to process 
one visual scene (Thor peet al., 1996] ), while tracking 
without predicting object movement, as the one re- 
quired for the oculo-motor control of smooth-pursuit 



traditional definition of tracking suffers from outliers 
or regions that drastically change their appearance or 
disappear from the scene. 



(Wilmer and Nakayama, 2007 1, requires faster pro- 
cessing of the visual stream. 

Inspired by recent findings on shallow feature ex- 
tractors of the visual cortex ( Vin tch et al., 2 010), we 
postulate that simple tracking processes are based on 
a shallow neural network that can identify quickly 
similarities between object features repeated in time. 
We propose an algorithm that can track and extract 
motion of an object based on the similarity between 
local features observed in subsequent frames. The lo- 
cal features are initially defined as a bounding box 
that defines the object to track. 

Traditional template matching algorithms define 
the tracking problem as follows: we are given two 
images F(x,y) and G(x,y) which represent the pix- 
els values at each location (x,y). We want to find 
the distance vector (h\,h2) that minimizes some mea- 
sures of the difference between F (x + hi ,y + h%) and 
G(x,y) ( |Lucas and Kanade, 1 98 1 ). The measures can 
be cross correlation, image intensity, color features, 
image gradients or color histograms. However, this 



In our work we change this definition of track- 
ing and propose a novel approach, Similarity Match 
Ratio (SMR). Instead of trying to minimize some 
measures of difference between F(x+h\,y + ho) and 
G(x,y), we want to find (hi,h2) that gives the best 
match ratio between F(x + h\,y + h%) and G(x,y). To 
do this, we are turning differences into a probabil- 
ity value and accumulating them for every pixel that 
has a good match. If there is no good match between 
F(x+h\,y+h2) and G(x,y), the difference gives zero 
probability because we are not interested in how badly 
the two pixels match. This approach is more robust to 
appearance change, disappearance and outliers. The 
method is tested on challenging benchmark video se- 
quences which include camera movement, partial/full 
occlusion, illuminance change, scale change and sim- 
ilar objects. State-of-the-art performance is achieved 
from these video sequences. 



2 PREVIOUS WORK 



Most popular trackers that are based on the 
traditional definition of the tracking problem (e.g. 
Sum-of-Squared-Distances (SSD), Sum-of-Absolute- 
Differences (SAD), Lucas-Kanade tracker) try to find 
distance vector {h\,h.-i) that minimizes the difference 
between F(x + h\ ,y + h%) and G(x,y) either on the 
grayscale or color image. However, the template 
G(x,y) may be including outliers or some parts that 
dramatically change or disappear, which cause track- 
ing failure. The common approach to overcome these 
tracking failures is that trackers should not treat all 
pixels in a uniform manner but eliminate outliers from 
the computation. 



Some studies (Comani ciu et al., 2003[ 
|Shi and Tomasi, 1994] ) proposed using a weighted 
histogram as a measure to minimize for the tracking. 
By assuming that pixels close to the center are the 
most reliable, these methods weigh them higher, 
since occlusions and interferences tend to occur close 
to boundaries. However, a dramatical change in 
the appearance can occur even in the center, which 
cannot be handled by this method. 

There are studies that aim to detect out- 
liers and suppress them from the computation. 
( Hager and Belhumeur, 1998) uses the common ap- 
proach that outliers produce large image differ- 
ences that can be detected by the estimation process 



(Black and Jepson, 1998 1. Residuals are calculated it- 
eratively and if the variations of the residual are big- 
ger than a user defined threshold they are considered 
outliers and suppressed. (Ishikawa et al., 2002) uses 
the spatial coherence property of the outliers which 
means that outliers tend to form a spatially coherent 
group rather than being randomly distributed across 
the template. In that work the template is divided 
into blocks and constant weights are assigned for each 
block. If the image differences of the blocks between 
the frames are large, it means these blocks include a 
significant amount of outliers. The method excludes 
the blocks that contain outliers from the computation 
of minimization. These methods are robust to out- 
liers. However, they are computationally expensive. 



(Kala l et al., 20105) tracks the points from the 
template back and forth between the previous frame 
and current frame and validates the detection. This 
method enables trackers to avoid tracking points that 
disappear from the camera view or change appearance 
drastically. Before our work, Kalal's tracker was the 
state-of-the-art. 




10 15 20 



10 15 33 



Figure 1: (Top) The red box is the SMR tracker's output, 
the blue box is the SAD tracker's output. The ground-truth 
from the first frame is used as a template which is shown 
on the left top corner of the frame. (Bottom) The absolute 
differences for each pixel between the template and result 
from the SMR tracker are mapped on the left and from the 
SAD tracker on the right. Dark values (close to zero) report 
a better match. Note that even though there are higher dif- 
ferences, the SMR tracker is able to find the correct patch. 

3 SIMILARITY MATCHING 
RATIO (SMR) TRACKER 

The SMR tracker uses a modified template- 
matching algorithm. In this algorithm, we look for 
similarity between a template G(x,y) and patches of a 
new video frame F(x + h\,y + h.2)- The SMR com- 
putes the difference between the template and the 
patches at each pixel. Templates are moved convolu- 
tionally on the new video frame, and stepped by one 
pixel. If this difference is lower than a threshold, it is 
summed to the output after negative exponential dis- 
tance conversion. This thresholding eliminates outly- 
ing pixels, in such a way that they do not appear in the 
final output. The SMR algorithm is as follows: 

1. The search area, {h\,h.2), is limited to the neigh- 
borhood of the target's previous position. 

2. For each pixel in the template G(x,y), the method 
is checking if the condition F(x + h\,y + h%) — 
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(a) (b) 
Figure 2: Histogram of the pixel differences that were mapped in Figure Q] (a) Map between the template and result from 
the SMR tracker and (b) result from the SAD tracker. The SAD tracker minimizes the number pixels with large differences, 
whereas the SMR tracker maximizes the number of pixels that have small differences. 



G(x,y) < a is satisfied, where a is a dynamic 
threshold defined in 6. 

3. If satisfied, we are interested in how close the 
match is, so the pixel difference is converted into 
a probability value p by p = exp(— \F(x + h\ ,y + 
hi) — G(x,y)\). If not these pixels are ignored. 

4. The probability values are summed up for 
each patch. The algorithm finds the (h\,h.2) 
that gives the highest similarity matching ratio, 
argmax/,,./,^/?. 

5. G(x,y) t +i — F(x + h\,y + hi) t The patch is ex- 
tracted in every detection and assigned as new 
template. 

6. Dynamic threshold a = max(G(x,y) t — 
G(x,y) t +i) ■ k where k = 0.25 is a constant 
determined experimentally. 

The biggest advantage of the SMR is that pixel 
differences above a are not contributing to the match- 
ing similarity output. These pixels may be outliers or 
points that dramatically change appearance, and thus 
should not effect the matching similarity. Outlying 
pixels usually only increase the error and cause fail- 
ure, so we chose to ignore them in this method. This 
way only reliably matching pixels contribute to the 
output of each matching step. 



4 RESULTS 

We tested this approach on a challenging bench- 
mark: the TLD ( JKalal et al., 2010a| l dataset. From 
this dataset six videos with different properties were 
selected as displayed in Table Q] Each video contains 
only one target. The metric used is the number of cor- 
rectly tracked frames. For this test color videos were 



converted to grayscale. State-of-the-art performance 
was achieved and results are presented in Table [2] 

To illustrate how the qualitatively different way of 
defining the tracking problem of the SMR tracker pro- 
vides better results than the traditional approach, we 
will compare the SMR tracker with the SAD tracker 
in the present section. 

Figure Q] shows the detections from the SAD 
tracker and the SMR tracker where they have used 
the same template. Points that dramatically changed 
appearance cause the SAD tracker to fail whereas the 
SMR tracker correctly detects the object. For illustra- 
tion purposes, the differences for each pixel between 
the template and the patches the SAD tracker and the 
SMR tracker detected are mapped in Figure Q] The 
patch the SMR tracker detected has a bigger sum of 
absolute differences. However, that is because of the 
region that dramatically changed appearance. That 
patch has many close matches with the template as 
can be seen in Figure [2] As such, the SMR tracker 
is able to detect it. Again, with the same principle 
the SMR tracker is able to track the object when it is 
going out of the scene as shown in Figure[3] 




Figure 3: The red boxes are the SMR tracker's outputs. The 
video frame is extended and padded by zeroes. The SMR 
tracker is able to track when the target is going out of the 
frame. The template update is ceased in these situations 
which prevents the drifting from the object. 




Figure 4: (Top) The red boxes are the SMR tracker's outputs. (Bottom) The blue boxes are the SAD tracker's outputs. 
Outlying pixels cause the SAD tracker to drift, whereas the SMR tracker is not affected by them. 



The SMR tracker is more robust to outliers than 
the traditional approach. As can be seen in Figure 
|4] outliers cause the SAD tracker to drift away from 
the object, whereas the SMR tracker (Figure [4} finds 
the target. Ideally the bounding box should be en- 
tirely filled with the target. However, during long- 
term tracking, the object may move back and forth 
and rotate which cause some background pixels to be 
included in the next template. A tracker does not 
know which pixel belong to the object and which 
ones belong to the background. On the other hand, 
the SMR tracker has a higher probability of rejecting 
background pixels, as they tend to change more. 

The SAD tracker from the 2nd frame to 3rd in Fig- 
ure [4] (bottom) drifts away from the object, because 
the pixels from the background have become included 
in the bounding box and they propagate to the tem- 
plate. When the face moves right, the SAD tracker 
does not move and drifts away from the object be- 
cause the background, which has high contrast, gives 
big differences if the bounding box shifts to a new po- 
sition. Therefore, the traditional approach gives pri- 
ority to preventing big distances when it is making a 
decision, even if these pixels are not the majority of 
the template. On the other hand, the SMR tracker is 
focusing on the number of pixels that have small dif- 
ferences with the template which is the face in this 
case Figure [4] (top). 



each detection are not observed by applying this 
method on the benchmark dataset. However, when 
an object becomes occluded very slowly, updating the 
template at every frame causes the template to include 
foreground pixels that are not belong to the object. An 
example can be seen in Figure [5] A better template 
update mechanism will prevent this kind of failure. 
This will most probably require the use of a classifier 
which is out of the scope of the work in this paper. 




5 FAILURE MODE 

Even though the SMR tracker updates the tem- 
plate at every frame in this presented work, drifts 
caused by the accumulation of small errors during 



Figure 5: Red boxes are the SMR tracker's results. The 
every-frame template update causes the outlying pixels to 
propagate to the templates. When outlying pixels dominate 
the template, the SMR tracker fails. 



6 CONCLUSION 

This paper proposes a novel approach of track- 
ing: the Similarity Matching Ratio (SMR). The SMR 
tracker is more robust to outliers than the traditional 
approaches because it is not collecting differences 
between the template and the frame for each pixel. 
Instead, it is collecting probabilities from the pixels 
that have small differences from the template. The 
SMR tracker tries to find a region which maximizes 
the good match not minimizes the differences for the 
whole template. This proves to be a superior ap- 
proach. The SMR tracker is tested on challenging 
video sequences and achieves state-of-the-art perfor- 
mance (See Table|2]i. 
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